Summary

Analyze the following dataset - ‘Home Prices’ - about median home prices in various cities in the US. Do some trend analysis as well as descriptive statistics, ANOVA by region, and t-tests, for cities and regions.

Your code and plots should be embedded into your report, in a similar format as the other projects. Also, provide a concluding section at the end, interpreting the result of your analysis. Use this opportunity to practice professional reporting for your data analysis as well as practicing for your final project.

# initialization
# remove global environment
rm(list = ls())

par(mfrow=c(1,1))

dataFile = 'data/Home Prices.csv'
shhh <- suppressPackageStartupMessages
shhh(library(plotly))
shhh(library(tidyr))


### functions

# distribution plots for each year
distplot <- function (df, title) {
  h <- hist(df, breaks=10, col="grey", xlab="price", main=title) 
  xfit <- seq(min(df), max(df),length=40) 
  yfit <- dnorm(xfit, mean=mean(df), sd=sd(df)) 
  yfit <- yfit*diff(h$mids[1:2])*length(df) 
  lines(xfit, yfit, col="blue", lwd=2)
}

# returns a new df with either the mean or median of year values
getYearVals <- function(func='mean') {
  years <- housing[,-c(1,2,3)]
                  
  if (func=='mean'){
    mean<- sapply(years,FUN=mean)
    return (mean)
  }
  else {
    median<- sapply(years,FUN=median)
    return (median)
  }

  yearvals = list(y04, y05,y06,y07,y08,y09,y10,y11,y12)
  return (yearvals)
}

# returns a new df for the specified region with the mean values
getRegionalMeans <- function(regionName, df) {
  reg <- subset(df, df['Region'] == regionName)
  reg <- reg[,-c(1,2,3)]
  regValues <- sapply(reg,FUN=mean)
  return (regValues)
}

#returns string w/o leading or trailing whitespace
trim <- function (x) gsub("^\\s+|\\s+$", "", x)

# returs mean aggreation by year
getRegionAggByYear<- function (year) {
  if (year == '2012.p')
    y <- '2012'
  else
    y <- as.numeric(year)
  ag <- aggregate(housing[year], by=list(housing$Region), mean)
  colnames(ag) <- c("Region", "Mean")
  ag$year = c(y,y,y,y)
  return (ag)
}
# load data

options(width = 100)

housing <- read.csv(dataFile, skip = 2, header = T)
attach(housing, warn.conflicts=FALSE)

# replace NA with 0's
housing[is.na(housing)] = 0

# R prefixes numeric fields (i.e. years) with X. Lets remove them
names(housing) <- sub("^X", "", names(housing))

# remove first column which is just blank
housing = housing[,-c(1)]
# create a new df with year as a single column
prices <- housing %>% gather(year, price, c(4:12))

# update 2012.p to 2012
prices$year[prices$year == "2012.p"] <- "2012"

# make a copy for additional transformations
prices2 <- data.frame(prices)

# remove rows with 0 values
prices2 = prices2[prices2$price != 0, ]

#paste(prices2$Metropolitan.Area) # show what cities have white spaces, including leading/trailing
prices2$Metropolitan.Area <- trim(prices2$Metropolitan.Area) # trim them

#str (prices2)
#head(prices2,5)
#tail(prices2, 5)

# means aggreation by Region
y04 <- getRegionAggByYear('2004')
y05 <- getRegionAggByYear('2005')
y06 <- getRegionAggByYear('2006')
y07 <- getRegionAggByYear('2007')
y08 <- getRegionAggByYear('2008')
y09 <- getRegionAggByYear('2009')
y10 <- getRegionAggByYear('2010')
y11 <- getRegionAggByYear('2011')
y12 <- getRegionAggByYear('2012.p')
sprintf('housing dataset has %s rows and %s columns', nrow(housing), ncol(housing))
[1] "housing dataset has 163 rows and 12 columns"
head(housing, 3)
            Metropolitan.Area State.s.    Region  2004  2005  2006  2007  2008  2009  2010  2011
1                 Abilene, TX       TX Southeast   0.0   0.0   0.0   0.0   0.0   0.0 112.8 119.2
2                  Akron, OH        OH   Central 116.9 120.5 114.6 119.3 100.5  93.2 108.9  90.9
3 Albany-Schenectady-Troy, NY       NY Northeast 161.3 183.5 195.4 198.9 197.9 189.1 195.7 193.8
  2012.p
1  124.6
2  109.5
3  200.3

Descriptive Statistics

# summary, exclude factors
summary(housing[,-c(1,2,3)])
      2004            2005            2006            2007            2008            2009      
 Min.   :  0.0   Min.   :  0.0   Min.   :  0.0   Min.   :  0.0   Min.   :  0.0   Min.   :  0.0  
 1st Qu.:112.5   1st Qu.:119.3   1st Qu.:122.2   1st Qu.:122.8   1st Qu.:127.3   1st Qu.:119.2  
 Median :137.4   Median :147.6   Median :151.3   Median :154.0   Median :153.3   Median :142.9  
 Mean   :168.0   Mean   :189.1   Mean   :196.8   Mean   :198.7   Mean   :185.3   Mean   :167.5  
 3rd Qu.:190.2   3rd Qu.:233.2   3rd Qu.:238.1   3rd Qu.:241.2   3rd Qu.:223.4   3rd Qu.:195.4  
 Max.   :698.5   Max.   :744.5   Max.   :775.0   Max.   :836.8   Max.   :668.0   Max.   :596.2  
      2010            2011           2012.p     
 Min.   :  0.0   Min.   :  0.0   Min.   :  0.0  
 1st Qu.:121.3   1st Qu.:115.7   1st Qu.:124.0  
 Median :144.7   Median :139.8   Median :145.6  
 Mean   :173.9   Mean   :162.2   Mean   :170.0  
 3rd Qu.:196.0   3rd Qu.:182.6   3rd Qu.:187.4  
 Max.   :607.6   Max.   :597.0   Max.   :645.0  
par(mfrow=c(3,3)) 

distplot(housing$`2004`, '2004')
distplot(housing$`2005`, '2005')
distplot(housing$`2006`, '2006')
distplot(housing$`2007`, '2007')
distplot(housing$`2008`, '2008')
distplot(housing$`2009`, '2009')
distplot(housing$`2010`, '2010')
distplot(housing$`2011`, '2011')
distplot(housing$`2012.p`, '2012')

We see that home prices are positively skewed. This is typical for home prices since most houses are sold at near the mean amount, however a few number of houses will sell for much higher.

ANOVA/ Students T Test

Run one-way ANOVA for all regions against home price.

one.way<-aov(prices2$price ~ prices2$Region)
summary(one.way)
                 Df   Sum Sq Mean Sq F value Pr(>F)    
prices2$Region    3  5485071 1828357   213.7 <2e-16 ***
Residuals      1366 11687319    8556                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The F-stat of 213.7 and pvalue of nearly 0 indicates there is a significant difference in the group means (regional prices). This is not unexpected as we do know there are large home price differences regionally. The intercept is the mean for the Central region.

Run one-way ANOVA tests for all metro areas against home price.

one.way<-aov(prices2$price ~ prices2$Metropolitan.Area)
summary(one.way)
##                             Df   Sum Sq Mean Sq F value Pr(>F)    
## prices2$Metropolitan.Area  162 15799016   97525   85.71 <2e-16 ***
## Residuals                 1207  1373374    1138                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The F-stat of 85.7 and pvalue of nearly 0 indicates there is a significant difference in the group means (metropolitan area), but this is much smaller than for region. This is not unexpected as we do know there are price differences within different metropolitan groupings.

Run one-way ANOVA tests for all states against home price.

one.way<-aov(prices2$price ~ prices2$State.s.)
summary(one.way)
##                    Df   Sum Sq Mean Sq F value Pr(>F)    
## prices2$State.s.   67 12239589  182680   48.22 <2e-16 ***
## Residuals        1302  4932801    3789                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The F-stat of 48.22 and pvalue of nearly 0 indicates there is a significant difference in the group means (state) but this is much smaller than for region and metropolitan areas. This is not unexpected as we do know there are price differences within different states.

Run one-way ANOVA tests for all years against home price.

one.way<-aov(prices2$price ~ prices2$year)
summary(one.way)
##                Df   Sum Sq Mean Sq F value   Pr(>F)    
## prices2$year    8   365156   45644   3.696 0.000279 ***
## Residuals    1361 16807234   12349                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The F-stat is 3.696 and pvalue is nearly 0. This small F-stat suggests that year is much less significant in influencing home prices, however we do know that home prices do generally trend up over time, but in this sample (2004-2012), there were unusually events such as the housing crisis which impacted this.
Overall, we would say that Region, State and Metro Area are the highest influences on housing prices.

Student’s T Test)

Lets use the t-test to establish that home prices in San Diego-Carlsbad-San Marcos are different from home prices in San Jose-Sunnyvale-Santa Clara.

Null hypothesis: Home prices are the same between ‘San Diego-Carlsbad-San Marcos, CA’ and ‘San Jose-Sunnyvale-Santa Clara, CA’. Alternate hypothesis: Home prices are different between ‘San Diego-Carlsbad-San Marcos, CA’ and ‘San Jose-Sunnyvale-Santa Clara, CA’

#sj = ca[ca['met'] == 'San Jose-Sunnyvale-Santa Clara, CA']['price']

sd <- subset(prices2, prices2$Metropolitan.Area == 'San Diego-Carlsbad-San Marcos, CA')
sd <- sd[,-c(1,2,3,4)]  # only want prices

sj <- subset(prices2, prices2$Metropolitan.Area == 'San Jose-Sunnyvale-Santa Clara, CA')
sj<- sj[,-c(1,2,3,4)]  # only want prices

ttest <- t.test(sd, sj)
#names(ttest)
print (ttest)

    Welch Two Sample t-test

data:  sd and sj
t = -4.0884, df = 15.805, p-value = 0.0008771
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -310.16922  -98.20856
sample estimates:
mean of x mean of y 
 470.2778  674.4667 

There is a statistically significant difference between home prices between ‘San Diego-Carlsbad-San Marcos,CA’ and ’San Jose-Sunnyvale-Santa Clara, CA. The t statistic is -4.0884 with a very low pvalue . So we can reject the null hypothesis that home prices are the same between the two cities and accept the alternate hypothesis.

Conclusion

This anlaysis compared regional, state and city housing trends from 2004 to 2012. In doing so I had to overcome several issues. First there were data issues including many (NAN) values in price as well as leading/or trailing spaces in the Metropolitan Area names. The first step was to perform data cleansing, including removing leading/trailing spaces in string fields. Missing values were either replaced with 0’s or removed entirely depending on the analysis conducted. Furthermore, the data file had housing data separated by years as columns. While this structure was fine for doing trend analysis by year, it was not conducisive for general analysis across all years. For the Anonva/t-tests, I chose to transform the data into a different dataframe with all years included as a single column.(See following print out).

The anlysis affirmed housing trends which we intuitively know to be true; namely there are regional pricing differences with the West and Northeast having significantly higher home prices than Southeast and Central. Central was the lowest. Within the same geographical region, there are also price fluctions, for example by city. This analys compared CA home prices between the SF bay area vs. San Diego and it was found that home prices in the bay area is higher. Lastly, the analysis showed historical housing trends, that is housing prices tend to trend up overtime, and increased sharply after 2004 but decreased sharply in 2009, which is attributed to the subprime lending crisis, leading to the great recession of 2008.